My IODS project github data repository ↗
Welcome to my IODS 2019 course project page!
This project diary is basically me trying to learn how to do statistical analyses with R. I look forward to improving my R skills and learning some good open data practices.
What I expect to learn?
Where I learned about the course?
I think I just brosed the HYMY courses in WebOodi.
The data is from Kimmo Vehkalahti’ study ASSIST 2014, which measured the approaches to learning of 183 university students in the Introduction to Social Statistics course in Fall 2014.
## [1] 166 7
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: int 37 31 25 35 37 38 35 29 38 21 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
The data has 7 variables:
Composite variables assessing the approaches to learning were formed combining items measuring each construct by calculating the item means for each participant. 166 cases were included in the analyses as 17 cases with the score 0 for exam points were omitted.
## # A tibble: 2 x 2
## gender n
## <fct> <int>
## 1 F 110
## 2 M 56
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## [1] 25.51205
## [1] 17 55
There are 110 females and 56 males in the dataset and the ages of the participants range from 17 to 55 years with the average age being 25.5.
p2 <- ggplot(data, aes(attitude)) + geom_histogram(color="white", fill="darkturquoise", binwidth = 2)
p3 <- ggplot(data, aes(deep)) + geom_histogram(color="white", fill="darkturquoise", binwidth = 0.2)
p4 <- ggplot(data, aes(stra)) + geom_histogram(color="white", fill="darkturquoise", binwidth = 0.2)
p5 <- ggplot(data, aes(surf)) + geom_histogram(color="white", fill="darkturquoise", binwidth = 0.2)
p6 <- ggplot(data, aes(points),) + geom_histogram(color="white", fill="darkturquoise", binwidth = 2)
figure <- ggarrange(p2, p3, p4, p5, p6,
labels = c("attitude", "deep", "stra", "surf", "points"),
ncol = 3, nrow = 2)
figurep7 <- ggpairs(data, mapping = aes(col=gender, alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))
p7## vars n mean sd median trimmed mad min max range skew
## age 1 166 25.51 7.77 22.00 23.99 2.97 17.00 55.00 38.00 1.89
## attitude 2 166 31.43 7.30 32.00 31.52 7.41 14.00 50.00 36.00 -0.08
## deep 3 166 3.68 0.55 3.67 3.70 0.62 1.58 4.92 3.33 -0.50
## stra 4 166 3.12 0.77 3.19 3.14 0.83 1.25 5.00 3.75 -0.11
## surf 5 166 2.79 0.53 2.83 2.78 0.62 1.58 4.33 2.75 0.14
## points 6 166 22.72 5.89 23.00 23.08 5.93 7.00 33.00 26.00 -0.40
## kurtosis se
## age 3.24 0.60
## attitude -0.48 0.57
## deep 0.66 0.04
## stra -0.45 0.06
## surf -0.27 0.04
## points -0.26 0.46
Other variables seem to be quite symmetrically distributed with a slight negative skewness on deep learning and exam points. By visual inspection, it seems there might be a gender difference in attitude towards statistics with men having more positive attitude in average.
Attitude and exam points are moderately correlated. There is also a weak positive and a weak negative correlation between exam points and strategic learning, and exam points and surface learning, respectively. Intrestingly, deep learning is not correlated with exam points. Surface learning is also negatively correlated to deep and surface learning as well as attitude (r = -.32 … -.16).
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.01711 3.68375 2.991 0.00322 **
## attitude 0.33952 0.05741 5.913 1.93e-08 ***
## stra 0.85313 0.54159 1.575 0.11716
## surf -0.58607 0.80138 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
I chose attitude, strategic learning and surgace learning as predictors in my regression model based on the strenght of correlation on exam points.
The F-statistic gives a test of the omnibus null hypothesis that all regression coefficients are zero. The F-statistic for the Model 1 is 14.13 with a p value less than .001. Following, it is highly unlikely that there are no non-zero regressions in the model and we can reject the null hypothesis.
The square of the multiple correlation coefficient (R^2) is .207, which signifies that the variables in the model account for about 21% of the variation in exam points.
However, the non-significant t-value of strategic and surface learning implies that attitude seems to be the only statistically significant predictor in the model (t = 5.91, p < .001). The t test tests whether the regression coefficient differs from zero.
The unstandardized regression coefficients are reported under “Estimate” in the “Coefficients” table. The coefficient .34 (p < .001) of attitude implies the strength of the relationship between attitude and exam points in the original scales of the variables, when strategic and surface learning is controlled for. However, we cannot make judgements about the relative importance of the predictor on the predicted variable using unstandardized coefficients. We can obtain the standardized values by multiplying the raw regression coefficient by multiplying the raw coefficient by the standard deviation of the explanatory variable and dividing by the standard deviation of the response variable:
attitude: 0.34 × 7.30 / 5.89 = 0.42
stra: 0.85 × 0.77 / 5.89 = 0.11
surf: -0.59 × 0.77 / 5.89 = -0.08
The standardized beta coefficient of attitude on exam points in this model is .42. Strategic (β = .11) and surface learning (β = -.08) did not statistically significantly predict exam points.
From the “Residuals” table, we can also see how the model residuals are distributed.
##
## Call:
## lm(formula = points ~ attitude, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.63715 1.83035 6.358 1.95e-09 ***
## attitude 0.35255 0.05674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
In Model 2 the two nonsignificant predictors, strategic and surface learning, were omitted and exam points were predicted only by attitude. The null hypothesis is rejected with statistically significant F value 38.61, p < .001.
Interpreting the model
The squared multiple correlation coefficient (R^2) ind Model 2 is .19, which implies that attitude explains 19% of variance in exam points. According to the t test, the regression coefficient of attitude differs from 0 with t = 6.21, p < .001. The unstandardized regression coefficient of attitude on exam points is .35 (p < .001). This implies that when attitude towards statistics increases by 1 in the original scale, there is an average of 0.35 increase in exam score. The standardized regression coefficient is is 0.44, which implies that when attitude increases by 1 SD, exam points increase 0.44 SD. Based on the results, we can say that students with more positive attitude on average score higher on the course exam, but the effect is relatively weak.
Regression diagnostics
The regression model has several assumptions:
The Residuals vs. Fitted plot shows a linear relationship between the variables. The normality of the variables was examined before the analysis via visuals and descriptive statistics. As there is only one predictor, there cannot be multicollinearity. The standardized residuals also seem to be normally distributed as implied by the relatively straight line in the Normal Q-Q plot. The Scale-Loaction plot shows that the residuals are spread equally along the range of the predictor which implies homoscedasticity. Lastly, the Residuals vs Leverage plot shows no influential outliers.
The data is Student Performance Data Set from UCI Machine Learning Repository. It approaches student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). Full description of the original data can be found here.
The two datasets have been merged using several variables as identifiers to combine individual students information: school, sex, age, address, family size, parents’ cohabitation status, parents’ education and job, reason to choose school, attendance to school nursery, and home internet access. If the student had answered to the same question on both questionnaires, the the rounded average was calculated. If the question was non-numeric, the answer on Mathematics performance dataset was used.
The R script about creating the merged dataset can be found here.
## [1] X school sex age address famsize
## [7] Pstatus Medu Fedu Mjob Fjob reason
## [13] nursery internet guardian traveltime studytime failures
## [19] schoolsup famsup paid activities higher romantic
## [25] famrel freetime goout Dalc Walc health
## [31] absences G1 G2 G3 alc_use high_use
| Variables | Information |
|---|---|
| sex | ‘s sex (binary: ’F’ - female or ‘M’ - male) |
| Medu | mother’s education (numeric: 0 none, 1 primary education (4th grade), 2 5th to 9th grade, 3 secondary education or 4 higher education) |
| failures | number of past class failures (numeric: n if 1<=n<3, else 4) |
| absences | number of school absences (numeric: from 0 to 93) |
| high_use | high alcohol consumption (TRUE: the average of self-reported workday and weekend alcohol consumption greater than 2 on a scale 1 -very low - 5 very high, FALSE: the average 2 or lower) |
Let’s take a glimpse of the structure and dimensions of the subset of data we are using:
## Observations: 382
## Variables: 5
## $ sex <fct> F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, F, F, F,...
## $ Medu <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3,...
## $ failures <int> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ absences <int> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, 3,...
## $ G3 <int> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 12,...
As Medu is truly a categorical variable, not a numerical one, it will have to be recoded into a factor.
alc$Medu <- factor(alc$Medu)
levels(alc$Medu) <- c("None", "Primary", "5th to 9th", "Secondary", "Higher")The purpose of my analysis is to study the relationships between high/low alcohol consumption and students’ demographic, social, and school-realted characetristics.
I chose following variables to explain the students’ alcohol consumption: student’s sex, father’s education, class failures, and absentees.
| vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| sex* | 1 | 382 | 1.482 | 0.500 | 1 | 1.477 | 0.000 | 1 | 2 | 1 | 0.073 | -2.000 | 0.026 |
| Medu* | 2 | 382 | 3.806 | 1.086 | 4 | 3.892 | 1.483 | 1 | 5 | 4 | -0.384 | -1.037 | 0.056 |
| failures | 3 | 382 | 0.202 | 0.583 | 0 | 0.033 | 0.000 | 0 | 3 | 3 | 3.034 | 8.689 | 0.030 |
| absences | 4 | 382 | 4.500 | 5.463 | 3 | 3.536 | 2.965 | 0 | 45 | 45 | 3.187 | 16.186 | 0.279 |
| G3 | 5 | 382 | 11.458 | 3.310 | 12 | 11.631 | 2.965 | 0 | 18 | 18 | -0.459 | 0.180 | 0.169 |
p1 <- ggplot(alc, aes(x = sex, fill = sex))
p1 + geom_bar() + theme(legend.position = "none") + ggtitle("Students' sex")+ ylab(" ") + xlab("Sex")| sex | n |
|---|---|
| F | 198 |
| M | 184 |
p2 <- ggplot(alc, aes(x = high_use, fill = sex))
p2 + geom_bar() + theme(legend.position = "none") + facet_wrap(~sex) +
labs(title = "High alcohol consumption of female (left) and male (right) students",
caption = "TRUE = high consumption, FALSE = low consumption") +
ylab(" ") + xlab("High alcohol consumption")alc %>%
group_by(high_use, sex)%>%
summarise(n=n())%>%
spread(sex, n) %>%
kable(caption = "<b>Table 3</b> High/low alcohol consumption by students' sex crosstabulated") %>%
kable_styling(bootstrap_options = c("striped", "hover"))| high_use | F | M |
|---|---|---|
| FALSE | 156 | 112 |
| TRUE | 42 | 72 |
Mother’s education
p3 <- ggplot(alc, aes(x = Medu))
p3 + geom_bar(fill = "deepskyblue2") + theme_classic() + ggtitle("Mother's education") + ylab(" ") + xlab("Education level of mother")alc %>%
group_by(high_use, Medu)%>%
summarise(n=n())%>%
spread(Medu, n) %>%
kable(caption = "<b>Table 4</b> High/low alcohol consumption by mother's education crosstabulated") %>%
kable_styling(bootstrap_options = c("striped", "hover"))| high_use | None | Primary | 5th to 9th | Secondary | Higher |
|---|---|---|---|---|---|
| FALSE | 1 | 33 | 80 | 59 | 95 |
| TRUE | 2 | 18 | 18 | 36 | 40 |
p3 <- ggplot(alc, aes(x = failures))
p3 + geom_bar(fill = "deepskyblue2") + theme_classic() + ggtitle("Class failures") + ylab(" ") + xlab("Number of class failures")alc %>%
group_by(high_use, failures)%>%
summarise(n=n())%>%
spread(failures, n) %>%
kable(caption = "<b>Table 5</b> High/low alcohol consumption by mother's education crosstabulated") %>%
kable_styling(bootstrap_options = c("striped", "hover"))| high_use | 0 | 1 | 2 | 3 |
|---|---|---|---|---|
| FALSE | 244 | 12 | 10 | 2 |
| TRUE | 90 | 12 | 9 | 3 |
p4 <- ggplot(alc, aes(x = absences))
p4 + geom_bar(fill = "deepskyblue2") + theme_classic() + ggtitle("Absences") + ylab(" ") + xlab("Number of absences")p5 <- ggplot(alc, aes(x = high_use, y = absences, fill = sex))
p5 + geom_boxplot() + ggtitle("Boxplots of absences vs. high/low alcohol use by gender") + ylab("Number of absences") + xlab("High alcohol consumption")alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean_absences = mean(absences)) %>% kable(digits = 2, caption = "<b>Table 5</b> Mean number absences by high/low alcohol consumption and sex") %>%
kable_styling(bootstrap_options = c("striped", "hover"))| sex | high_use | count | mean_absences |
|---|---|---|---|
| F | FALSE | 156 | 4.22 |
| F | TRUE | 42 | 6.79 |
| M | FALSE | 112 | 2.98 |
| M | TRUE | 72 | 6.12 |
p6 <- ggplot(alc, aes(x = G3))
p6 + geom_histogram(color = "white", fill = "deepskyblue2", bins=18) + theme_classic() + ggtitle("Grade") + ylab(" ") + xlab("Grade")p7 <- ggplot(alc, aes(x = high_use, y = G3, fill = sex))
p7 + geom_boxplot() + ggtitle("Boxplots of grades vs. high/low alcohol use by gender") + ylab("Grade") + xlab("High alcohol consumption")alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean_grade = mean(G3)) %>% kable(digits = 3, caption = "<b>Table 6</b> Mean grade by high/low alcohol consumption and sex") %>%
kable_styling(bootstrap_options = c("striped", "hover"))| sex | high_use | count | mean_grade |
|---|---|---|---|
| F | FALSE | 156 | 11.397 |
| F | TRUE | 42 | 11.714 |
| M | FALSE | 112 | 12.205 |
| M | TRUE | 72 | 10.278 |
# Fitting the logistic regression
mod1 <- glm(high_use ~ Medu + failures + absences + sex, data = alc, family = "binomial")
# Summary of the model
summary(mod1)##
## Call:
## glm(formula = high_use ~ Medu + failures + absences + sex, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3466 -0.8419 -0.6057 1.0340 2.3030
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.07693 1.26355 0.061 0.9514
## MeduPrimary -1.67979 1.29798 -1.294 0.1956
## Medu5th to 9th -2.65576 1.29394 -2.052 0.0401 *
## MeduSecondary -1.72156 1.28764 -1.337 0.1812
## MeduHigher -2.00750 1.28668 -1.560 0.1187
## failures 0.43321 0.20127 2.152 0.0314 *
## absences 0.09627 0.02427 3.967 7.28e-05 ***
## sexM 0.97945 0.24932 3.928 8.55e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 412.83 on 374 degrees of freedom
## AIC: 428.83
##
## Number of Fisher Scoring iterations: 4
As Medu does not consistently predict high alcohol consumption well (the only significant coefficient being class “5th to 9th” [p =.040], I omitted the variable from the model. This also makes the model easier to interpret.
# Fittiing Model 2
mod2 <- glm(high_use ~ failures + absences + sex, data = alc, family = "binomial")
# Summary of Model 2
summary(mod2)##
## Call:
## glm(formula = high_use ~ failures + absences + sex, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1855 -0.8371 -0.6000 1.1020 2.0209
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.90297 0.22626 -8.411 < 2e-16 ***
## failures 0.45082 0.18992 2.374 0.017611 *
## absences 0.09322 0.02295 4.063 4.85e-05 ***
## sexM 0.94117 0.24200 3.889 0.000101 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 424.40 on 378 degrees of freedom
## AIC: 432.4
##
## Number of Fisher Scoring iterations: 4
# Computing odds ratios and confidence intervals
OR <- coef(mod2) %>% exp
CI <- confint(mod2) %>% exp## Waiting for profiling to be done...
# Printing out the odds ratios with their confidence intervals
cbind(OR, CI) %>% kable(digits = 3, caption = "<b>Table 7</b> Odds ratios and their confidence intervals") %>% kable_styling(bootstrap_options = c("striped", "hover"))| OR | 2.5 % | 97.5 % | |
|---|---|---|---|
| (Intercept) | 0.149 | 0.094 | 0.229 |
| failures | 1.570 | 1.083 | 2.295 |
| absences | 1.098 | 1.052 | 1.151 |
| sexM | 2.563 | 1.604 | 4.149 |
As we can see from the model summary, the intercept of high alcohol consumption is -1.90, which is more than 8 standard deviations (z value or the Wald’s Test value) away from 0 on the standard normal curve with a statistically significant p < .001. The slope coefficient of, for example, absences is 0.093. This means that for one point increase in absences the log of the odds of high alcohol consumption increases 0.09. The z values of coefficients of failures, absences, and sex ar positive and over 2 standard deviations away from 0 and are statistically significant with p < .05 for failures and p < .001 for absences and sex.
From the odds ratios we we can see that when the effect of the other predictor variables are taken into account…
Conclusion: As expected, class failures, school absences, and student’s sex predict higher alcohol consumption. Male students are more likely be high alcohol consumers. Class failures and absentees also increase the probability of higher consumption. However, mother’s education doesn’t seem to predict alcohol use consistently.
# predict() the probability of high_use
probabilities <- predict(mod1, type = "response")
# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)
# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)
# see the last ten original classes, predicted probabilities, and class predictions
select(alc, Medu, failures, absences, sex, high_use, probability, prediction) %>% tail(10)## Medu failures absences sex high_use probability prediction
## 373 Primary 1 0 M FALSE 0.45259271 FALSE
## 374 Higher 1 7 M TRUE 0.53891660 TRUE
## 375 5th to 9th 0 1 F FALSE 0.07708996 FALSE
## 376 Higher 0 6 F FALSE 0.20538904 FALSE
## 377 5th to 9th 1 2 F FALSE 0.12421781 FALSE
## 378 Secondary 0 2 F FALSE 0.18968092 FALSE
## 379 Primary 2 2 F FALSE 0.36728009 FALSE
## 380 Primary 0 3 F FALSE 0.21181023 FALSE
## 381 Secondary 0 4 M TRUE 0.43043082 FALSE
## 382 Secondary 0 2 M TRUE 0.38399298 FALSE
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)## prediction
## high_use FALSE TRUE
## FALSE 257 11
## TRUE 78 36
# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))
# define the geom as points and draw the plot
g + geom_point()# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table() %>% addmargins()## prediction
## high_use FALSE TRUE Sum
## FALSE 0.67277487 0.02879581 0.70157068
## TRUE 0.20418848 0.09424084 0.29842932
## Sum 0.87696335 0.12303665 1.00000000
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)## [1] 0.2329843
The average number of frong predictions in the data is 23% using student’s sex, absences, and class failures as predictors. This means that the prediction was right about three times out of four. This is significantly better than by just guessing (error rate of 50%). The model was especially accurate at predicting low alcohol consumption by preidcting correctly 257 times out of 268. However, the model predicted wrong most of the cases where the alcohol consumption was categorized high.